code
share


Vital Elements of Calculus Series

Part 2: Derivatives at a point and the Numerical Differentiator

In the previous post we described in words and pictures what the derivative at a point is - in this post we get more formal and describe these ideas mathematically and programatically.

In [1]:
# imports from custom library
import sys
sys.path.append('../../')
from mlrefined_libraries import calculus_library as calclib
from mlrefined_libraries import basics_library as baslib

# import autograd
import autograd.numpy as np
import matplotlib.pyplot as plt

1. How do we define the derivative driven tangent line?

We will start our discussion with functions that take in only one input like the familiar sinusoidal function

\begin{equation} g(w) = \text{sin}(w) \end{equation}

which takes in the single input $w$ (we generalize afterwards to functions that take in more than one input).

Remember what we said in words / pictures previously about the derivative of a function at a point: the derivative at a point defines a line that is always tangent to a function, encodes its steepness at that point, and generally matches the underlying function near the point locally. In other words: the derivative at a point is the slope of the tangent line there.

The derivative at a point is the slope of the tangent line at that point.

How can we more formally describe such a tangent line and derivative?

1.1 Secant lines

In the image below we show a picture of the sinusoid in the left panel, where we have plugged the input point $w_0 = 0$ into the sinusoid and highlighted <font color = #32cd32> the corresponding point $(0, \text{sin}(0))$ in green </font>. In the middle panel we plot another point on the curve - with input $w_1 = -2.6$ <font color = 'blue' > the point $(-2.6, \text{sin}(-2.6) ) $ in blue </font>, and <font color = 'red'> the secant line in red </font> formed by connecting <font color = 'blue'> $(-2.6, \text{sin}(-2.6) ) $ </font> and <font color = #32cd32> $(0, \text{sin}(0))$ </font>. Finally in the right panel we show <font color = #32cd32> the tangent line at $w = 0$ in lime green. </font> The <font color = 'gray' > gray vertical dashed lines </font> in the middle panel are there for visualization purposes only.

A secant line is just a line formed by taking any two points on a function - like our sinusoid - and connecting them with a straight line. On the other hand, while a tangent line can cross through several points of a function it is explicitly defined using only a single point. So in short - a secant line is defined by two points, a tangent line by just one.

The equation of any secant line is easy to derive - since all we need is the slope and any point on the line to define it - and the slope of a line can be found using any two points on it (like the two points we used to define the secant to begin with.

The slope - the line's 'steepness' or 'rise over run' - is the ratio of change in output $g(w)$ over the change in input $w$. If we used two generic inputs $w_0$ and $w_1$ - above we chose $w_0 = 0$ and $w_1 = -2.6$ - we can write out the slope of a secant line generally as

\begin{equation} \text{slope of a secant line} = \frac{g(w_1) - g(w_0)}{w_1 - w_0} \end{equation}

Now using the point-slope form of a line we can directly write out the equation of a secant using the slope above and either of the two points we used to define the secant to begin with - using $(w_0, g(w_0))$ we then have the equation of a secant line $h(w)$ is

\begin{equation} h(w) = g(w_0) + \frac{g(w_1) - g(w_0)}{w_1 - w_0}(w - w_0) \end{equation}

If we think about our <font color = #32cd32> green point </font> at $w_0 = 0$ as fixed, then the tangent line at this point can be thought of as the line we get when we shift the <font color = 'blue' > blue point </font> very close - infinitely close actually - to the green one.

Example. Secant line computation

Taking $w_0 = 0$ and $w_1 = -2.6$ the equation of the secant line connecting $(w_0,\text{sin}(w_0))$ and $(w_1,\text{sin}(w_1))$ on the sinusoid is given as

$$h(w) = \text{sin}(0) + \frac{\text{sin}(-2.6) - \text{sin}(0)}{-2.6 - 0}(w - 0)$$

Since $\text{sin}(0) = 0$ and $\text{sin}(-2.6) \approx -0.5155$ we can write this as

$$h(w) = \frac{0.5155}{2.6}w$$

1.2 From secant to tangent line

The next Python cell activates a slider-based animation widget that illustrates precisely this idea. As you shift the slider from left to right the <font color = 'blue'> blue point </font> - along with the <font color = 'red'> red secant line </font> that passes through it and the <font color = #32cd32 > green point </font> - moves closer and closer to our fixed point. Finally - when the two points lie right on top of each other - the <font color = 'red'> secant line </font> becomes the <font color = #32cd32> green tangent line </font> at our fixed point.

In [2]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: np.sin(w)

# create an instance of the visualizer with this function
st = calclib.secant_to_tangent.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it(w_init = 0, num_frames = 200)
Out[2]:



In sliding back and forth, notice how it does not matter if we start from the left of our fixed point and move right towards it, or start to the right of the fixed point and move left towards it: either way the secant line gradually becomes tangent to the curve at $w_0 = 0$. There is no big 'jump' in the slope of the line if we wiggle the slider ever so slightly to the left or right of the fixed point - the slopes of the nearby secant lines are very very similar to that of the tangent.

When we can do this - come at a fixed point from either the left or the right and the secant line becomes tangent smoothly from either direction with no jump in the value of the slope - we say that a function has a derivative at this point, or likewise say that it is differentiable at the point.

If the slope of the secant line varies gradually - with no visible jumps - from both the left and right of a fixed point on a function, we say that a function has a derivative at this point, or likewise say that it is differentiable at the point. A function that has a derivative at every point is called differentiable.

Example. The hyperbolic tangent, squared

Many functions like our sinusoid, other trigonometric functions, and polynomials are differentiable at every point - or just differentiable for short. You can tinker around with the previous Python cell - pick another fixed point! - and see this for yourself. You can also tinker around with the function - for example in the next cell we show - using the same slider mechanism - that the function

\begin{equation} g(w) = \text{tanh}(w)^2 \end{equation}

has a derivative at the point $w_0$ = 1.

In [3]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: np.tanh(w)**2

# create an instance of the visualizer with this function
st = calclib.secant_to_tangent.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it(w_init = 1, num_frames = 300)
Out[3]:



Example. An example of failure: the rectified linear unit

Notice: that the slope of the secant line must smoothly change to the slope of the tangent line from both directions - from both the left and right - is important to this definition. There are plenty of functions where this does not occur at every point, like the function

\begin{equation} g(w) = \text{max}(0,w) \end{equation}

at the point $w_0 = 0$. This function is called a rectified linear unit or relu for short. Using the slider widget we can see that the slope of the secant line visibly jumps at this point. Move the slider back and forth around where $w = 0$ and watch the slope of the secant jump distinctly from zero to one. Because the slopes of the secant lines just to the left and right of the fixed point $w_0 = 0$ fail to line up, the function does not have a derivative here. So try as you might, the line will never turn green.

In [4]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: np.maximum(w,0)

# create an instance of the visualizer with this function
st = calclib.secant_to_tangent.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it(w_init = 0, num_frames = 200,mark_tangent = False)
Out[4]:



1.3 From secant slope to derivative

With this in mind how can we compute the equation of a tangent line at some point $w_0$ for a given function? More specifically, how can we compute the derivative here - or the slope of this tangent line? Well we know that if we take another point $w_1$ on either side of $w_0$ and connect the two - creating the secant line with equation

\begin{equation} h(w) = g(w_0) + \frac{g(w_1) - g(w_0)}{w_1 - w_0}(w - w_0) \end{equation}

that as we push $w_1$ ever closer towards $w_0$ that this secant becomes our tangent line when $w_1 \approx w_0$. Now note $w_1$ appears only in the slope of this equation, hence the slope of this line is the only quantity that changes as $w_1$ gets closer to $w_0$ and the secant line becomes tangent at $w_0$. This is great because now in our aim to understand the tangent line we can focus our attention solely on what is happening with the slope of the secant - which is precisely the derivative (the slope of the tangent line) that we are after.

Now, remember that the slope of a line measures its slope, or 'rise over run' which is the change in its vertical value ($g(w_1) - g(w_0)$) over the change in its horizontal value ($w_1 - w_0$). In other words

\begin{equation} \text{slope of secant line} = \frac{\text{change in $g$}}{\text{change in $w$}} = \frac{g(w_1) - g(w_0)}{w_1 - w_0} \end{equation}

As $w_1$ inches ever closer to $w_0$ - from either the left or the right of $w_0$ - the change in both $g$ and $w$ becomes incredibly small or infinitesimal. And this is how the derivative is conceptually defined: as the slope of a secant line where $w_1$ is so close to $w_0$ that the change in $g$ and $w$ are both infinitesimal. And remember: the value of this slope needs to be the same whether or not $w_1$ lies to the left or right of $w_0$.

The derivative of a function $g$ at a point $w_0$ is the slope of the tangent line there, which in turn is the slope of a secant line where $w_1$ is so close to $w_0$ that the both the change in $g$ and $w$ defining the slope of the tangent are infinitesimal small.

1.4 Refining the definition of the derivative

Lets quantify more explicitly using math notation what this definition means, first by backing off the 'infinitesimally small' part for a moment - lets just make the difference very small. We can define a generic point very close to and to the right of $w_0$ by denoting by $\epsilon$ some small positive number (e.g., $\epsilon = 0.0001$), then the point $w_1 = w_0 + \epsilon$ is indeed quite close to $w_0$. Following, then the slope of the secant line connecting $(w_0,g(w_0))$ to $(w_0 + \epsilon, g(w_0 + \epsilon))$ is given as

\begin{equation} \frac{g(w_1) - g(w_0)}{w_1 - w_0} = \frac{g(w_0 + \epsilon) - g(w_0)}{w_0 + \epsilon - w_0} = \frac{g(w_0 + \epsilon) - g(w_0)}{\epsilon} \end{equation}

To ensure that this value is indeed close to the derivative value we need to check that the slope of this secant line is very similar to the slope of a secant based at $w_0$ and going through a point slightly to the left of $w_0$. Taking the same value for $\epsilon$ we can take the point $w_0 - \epsilon$ which lies just to the left of $w_0$. Forming the secant connecting points $(w_0, g(w_0))$ and $(w_0 - \epsilon, g(w_0 - \epsilon))$ we can compute its slope as

\begin{equation} \frac{g(w_1) - g(w_0)}{w_1 - w_0} = \frac{g(w_0 - \epsilon) - g(w_0)}{w_0 - \epsilon - w_0} = - \frac{g(w_0 - \epsilon) - g(w_0)}{\epsilon} \end{equation}

If there is indeed a derivative at $w_0$ then the value of this slope needs to closely match the slope of our first secant, or in other words

\begin{equation} \frac{g(w_0 + \epsilon) - g(w_0)}{\epsilon} \approx - \frac{g(w_0 - \epsilon) - g(w_0)}{\epsilon} \end{equation}

And - moreover - as we make $\epsilon$ smaller and smaller these two quantities should both settle down to one value, and be perfectly equal to each other.

Notice that we can express this more compactly if we let $\epsilon$ represent a small (in magnitude) positive or negative number. Then we can say equivalently that we desire that the quantity

\begin{equation} \frac{g(w_0 + \epsilon) - g(w_0)}{\epsilon} \end{equation}

to settle down as we make $\epsilon$ smaller and smaller in magnitude. We can still think about this more compact formula as representing the slope of secant lines on either side of $w_0$, getting ever closer on both sides to $w_0$ we make the magnitude of $\epsilon$ infinitesimally small.

Writing this algebraically we say that we want the value $ \frac{g(w_0 + \epsilon) - g(w_0)}{\epsilon} $ to converge to a single value as $\vert\epsilon\vert \longrightarrow 0$.

Common notations for the derivative

One common notation used to denote this ratio of infinitesimal changes $\frac{\text{infinitesimal change in $g$}}{\text{infinitesimal change in $w$}}$ is $\frac{\mathrm{d}g}{\mathrm{d}w}$. Here the symbol $\mathrm{d}$ means 'infinitely small change in the value of'. A common variation on this notation puts the $g$ out front, like this $ \frac{\mathrm{d}}{\mathrm{d}w}g$. In short - we have both the definition and symbol to denote a general derivative of $g$ at any point as

\begin{equation} \text{derivative} = \frac{\text{infinitesimal change in $g$}}{\text{infinitesimal change in $w$}}:= \frac{\mathrm{d}g}{\mathrm{d}w} \,\,\, \text{or} \,\,\, \frac{\mathrm{d}}{\mathrm{d}w}g \end{equation}

There are other notations commonly used in practice to denote the derivative, but we will stick to using these.

To denote the derivative at a specific point $w_0$ we will write

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}g(w_0) \end{equation}

Example. Computing approximate derivatives at a point

Take our sinusoid, the point $w_0 = 0$, and a small magnitude value for $\epsilon$ like $\epsilon = 0.0001$. Computing the slope of a secant line where $w_1 = w_0 + \epsilon$ lies just to the right of $w_0$ we have

$$ \frac{g(w_0 + \epsilon) - g(w_0)}{\epsilon} = \frac{\text{sin}(0.0001)}{0.0001}\approx 0.99999$$

Likewise computing the slope of the secant line where $w_1 = w_0 - \epsilon$ lies just to the left of $w_0$ we have

$$ -\frac{g(w_0 - \epsilon) - g(w_0)}{\epsilon} = -\frac{\text{sin}(-0.0001)}{0.0001}\approx 0.99999$$

Indeed both slopes are approximately equal, so we can definitively say at $w_0 = 0$

$$ \frac{\mathrm{d}}{\mathrm{d}w}g(w_0) \approx 0.99999 $$

Using this we can write out the equation of the tangent line to the sinusoid at $w_0 = 0$ as

$$ h(w) = \text{sin}(0) + 0.9999(w - 0) = 0.9999w$$

Example. Checking non-differentiability at $w = 0$ for the relu function

Checking differentiability of the relu function

$$ g(w) = \text{max}(0,w) $$

at $w_0 = 0$ we have that the slope of a secant where $w_1 = w_0 + \epsilon$ for any small $\epsilon > 0$ (e.g.,$\epsilon = 0.0001$) coming from the right

$$ \frac{g(w_0 + \epsilon) - g(w_0)}{\epsilon} = \frac{\text{max}(0,0.0001)}{0.0001}= \frac{0.0001}{0.0001} = 1$$

A similar computation where $w_1 = w_0 - \epsilon$ comes in from the left gives

$$ -\frac{g(w_0 - \epsilon) - g(w_0)}{\epsilon} = -\frac{\text{max}(0,-0.0001)}{0.0001}= -\frac{0}{0.0001} = 0$$

Since these two secant slopes do not match up, the function is not differentiable at $w_0 = 0$, and these computations hold regardless of the magnitude of $\epsilon$.

2. Our first derivative calculator: the Numerical Differentiation

In this short Section we briefly discuss our first of several methods for calculating derivatives programatically (or, in other words, coding up a derivative calculator): the Numerical Differentiator. To create this calculator we just code up the the definition of the derivaitve at a point discussed in the previous section.

2.1 Just use the definition

The most straightforward way to build a derivative calculator is to just use the definition of the derivative at a point given above, and used extensively in examples in the previous Section. That is, for small (magnitude) $\epsilon$ the value the derivative of a function $g(w)$ is approximately

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}g(w) \approx \frac{ g(w + \epsilon) - g(w)}{\epsilon} \end{equation}

So if we want to make a program that estimates the derivative of some function at a point, why not simply choose a small positive value for $\epsilon$, approximate every derivative we encounter simply as

$$ \frac{ g(w + \epsilon) - g(w)}{\epsilon} $$

and call it a day? This would clearly be extremely easy to code up - a (more or less) one line derivative calculator.

Example. A numerical derivative calculator

In the next Python cell we provide a Python class that simply implements the above numerical definition of the derivative for a user-defined choice of $\epsilon$. Those wanting a good introduction to Python classes, in particular for implementing mathematical functions and objects, can see e.g., this excellent book.

In [5]:
class numerical_derivative:
    '''
    A function for computing the numerical derivative
    of an arbitrary input function and user-chosen epsilon
    '''
    def __init__(self, g,**kwargs):
        # load in function to differentiate, set epsilon to desired value or use default
        self.g = g; self.epsilon = 10*-5
        if 'epsilon' in kwargs:
            self.epsilon = kwargs['epsilon']

    def __call__(self, w):
        # make local copies 
        g, epsilon = self.g, self.epsilon     
        
        # compute derivative approximation and return
        approx = (g(w+epsilon) - g(w))/epsilon
        return approx

Lets check that this function will indeed compute accurate derivatives for a simple function whose derivatives we can visually verify are correct / incorrect

$$ g(w) = \text{sin}(w) $$

This elementary function actually has an algebraic formula for its derivative - as we will see in the next post - which is given by $\frac{\mathrm{d}}{\mathrm{d}w}g(w) = \text{cos}(w)$.

In the next Python cell we run a fine grid of points on the interval [-5,5] through our Numerical Differentiator and plot the result - along with the original function.

In [6]:
# make function, create derivative
g = lambda w: np.sin(w)
der = numerical_derivative(g,epsilon = 10**-10)

# evaluate the derivative over this range of input
wvals = np.linspace(-5,5,100)
gvals = [g(w) for w in wvals]
dervals = [der(w) for w in wvals]

# plot function and derivative
plt.plot(wvals,gvals,color = 'k',label = 'original function')
plt.plot(wvals,dervals,color = 'r',label = 'numerical derivative') 
plt.legend(bbox_to_anchor=(1.05, 1), loc=2); plt.xlabel('$w$')
plt.show()

Looks good!


But what value of $\epsilon$ can we trust to give us a close approximation to the true derivative value? Well, for functions that do not change very rapidly we can trust a moderately small value - like e.g., $\epsilon = 10^{-3}$. But this will not be small enough for more more rapidly changing functions

Example. A rapidly changing approximation to the derivative

Take for example the funciton printed out by the next Python cell. In this cell we use a widget first introduced in the second post in this series, where the numerical definition of the derivative was first discussed. This widget illustrates this derivative definition for a user defined function one particular point - here zero - as the slope of secant lines passing through points close to it. As you moves the slider from left to right, focus your attention on the secant line slopes formed using points just around zero: the actual derivative value here is zero, but the approximations (the slope of secant lines to either the left or right of zero) are both quite large in magnitude.

In [7]:
# what function should we play with?  Defined in the next line, along with our fixed point where we show tangency.
g = lambda w: np.cos(20*w)/(w**2 + 1)

# create an instance of the visualizer with this function
st = calclib.secant_to_tangent.visualizer(g = g)

# run the visualizer for our chosen input function and initial point
st.draw_it(w_init = 0, num_frames = 200)
Out[7]:



To hammer home the point lets see how well the numerical derivative calculator we made works on this function - using the same settings as before.

In [8]:
# make function, create derivative
g = lambda w: np.cos(20*w)/(w**2 + 1)
der = numerical_derivative(g,epsilon = 10**-10)

# evaluate the derivative over this range of input
wvals = np.linspace(-3,3,300)
gvals = [g(w) for w in wvals]
dervals = [der(w) for w in wvals]

# plot function and derivative
plt.plot(wvals,gvals,color = 'k',label = 'original function')
plt.plot(wvals,dervals,color = 'r',label = 'derivative') 
plt.legend(bbox_to_anchor=(1.05, 1), loc=2); plt.xlabel('$w$')
plt.show()

With the same setting - $\epsilon = 10^{-10}$ - we get a highly innacurate derivative calculation along almost the entire input space! The numerical derivative can be quite bad for functions that wiggle a good amount around like this one.


The point of this example is that if we want a derivative calculator built on the numerical definition of the derivative to be generally applicable, we should set $\epsilon$ to an extremely small positive number.

But here comes the rub - setting $\epsilon$ too small creates a second problem called round-off error. Numerical values - whether or not they are produced from a mathematical function - can only be represented up to a certain accuracy on a computer. In particular, we always have a tough time representing fractional numbers $\frac{a}{b}$ where both $a$ and $b$ are close to zero. But - as we make $\epsilon$ small - this is precisely what becomes of the approximation

$$ \frac{ g(w + \epsilon) - g(w)}{\epsilon} $$

since both the top (since the values $g(w + \epsilon)$ and $g(w)$ become essentially identical) and bottom of this fraction become incredibly small as we shrink the value of $\epsilon$.

Example. Round-off errrors in calculating the derivative at a point

Take for example [1] the approximation to the derivative of $g(w) = \text{sin}(w)$ at the point $w = 0.5$. From the derivative rules listed in Table 2 of the previous post we know that $\frac{\mathrm{d}}{\mathrm{d}m}g(w) = \text{cos}(w)$ here, and at $w = 0.5$ the value of cosine is $\text{cos}(0.5) \approx 0.8775825619$.

In the next cell we evaluate the derivative approximation for a range of very small values of $\epsilon$ and plot the result. As can be seen in the assocated plot, the approximations oscilate over a range of values that are suprisingly innacurate given the small values of $\epsilon$ chosen (e.g., the value $1$ is reached quite often), and moreover the value of the approximation actually becomes zero after a certain value for $\epsilon$ - at around the value $10^{-17}$ - and never recovers. Indeed the value $0$ is a poor approximation to the true derivative value $0.8775825619$, and the fact that this occurs is completely due to the inherent problem with representing ratios of small values (round-off error) on a computer.

In [9]:
# function
der = lambda w,epsilon: ((np.sin(w + epsilon) - np.sin(w)) /(epsilon))

# look over a range of epsilon values and point to approximate derivative at
eps_range = np.linspace(10**-15,10**-20,100); w = 0.5
der_vals = der(w,eps_range)

# plot
eps_range.shape = (len(eps_range),1)
der_vals.shape = (len(der_vals),1)
table = np.concatenate((eps_range,der_vals),axis=1)
baslib.basics_plotter.single_plot(table = table,xlabel = '$\epsilon$',ylabel = 'approximation',rotate_ylabel = 90)

This sort of behavior is not an isolated phenomenon with this particular example, but occurs more regularly as well.

Summary and whats next

In this post we saw how to define the derivative at a point for any generic function. We also saw how to immediately implement this definition as a way of calculating derivatives numerically - making the so-called Numerical Differentiator. This calculator was shown to be less ideal (if it was perfect we might have just called it a day), and in the next post we will dig deeper into understanding how to calculate derivatives of elementary functions - leading to a much more accurate way to compute derivatives numerically.

The content of this notebook is supplementary material for the textbook Machine Learning Refined (Cambridge University Press, 2016). Visit http://mlrefined.com for free chapter downloads and tutorials, and our Amazon site for details regarding a hard copy of the text.